What Is Data?

Andy Grogan-Kaylor
2021-08-27

Introduction

A data set is nothing more than a series of rows and columns that contain answers to responses to a survey:

Simulated Data

Both The Data And Documentation Are Useful

In working through our research questions, we’ll constantly be going back and forth between the actual data (to see the pattern of responses) and the documentation, to figure out the actual question asked as well as how the different responses are coded.

Hypothetical Survey
Question 1 What neighborhood do you live in?
0-Neighborhood A
1-Neighborhood B
2-Other Neighborhood (please indicate)
-9-Don’t Know / Refused
Question 3 What is your income?
$__________ (annual number)
-9-Don’t Know / Refused

Missing Values

Some cells of the table above have a negative number. Frequently negative numbers are used to indicate what are called “missing values”. A missing value is a response like “don’t know” or “refused to answer” or “did not answer”. Before we start doing calculations with our data, we’ll want to change negative numbers to true missing values (usually symbolized by a “.”, or an “NA”, so that they don’t goof up our calculations.

Variable Names Should Be Short

Often in a spreadsheet, you’ll see the full text of a question written out (e.g. “What neighborhood do you live in”?) Most programs that work with data are going to want abbreviations (e.g. “Q1” or “neighborhood”) for the questions. These abbreviations should usually have no spaces and be 8 characters or less.